In [72]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
from sklearn import svm, grid_search, metrics, preprocessing, linear_model, cross_validation
from sklearn.learning_curve import validation_curve, learning_curve
%matplotlib inline

## Utility functions :
def plot_PCA(X, y, classes, azimuth, elevation):
    """ X numpy array containing data, 
    y numpy array containing the labels, 
    classes a list containing the label names
    azimuth, elevation float numbers """
    nClasses = len(classes)
    fig = plt.figure(1, figsize=(10, 8))
    plt.clf()
    ax = Axes3D(fig, axisbg='white', azim=azimuth, elev=elevation)
    pca = PCA(n_components=3)
    X_redux = pca.fit_transform(X)
    print(pca.explained_variance_ratio_)
    print("Variability of the compressed dataset %.4f" %sum(pca.explained_variance_ratio_))
    cmap = plt.cm.Paired
    colors = np.linspace(0, 1, nClasses)
    for n in range(nClasses) :
        index = np.ravel(y==(n+1))
        ax.scatter(X_redux[index, 0], X_redux[index, 1], X_redux[index, 2], c=cmap(colors[n]), label=classes[n])
    ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    plt.title("Data plotted on the three first components", y=1.03) # y prevent title overlap with axes
    plt.show()

def plot_validation_scores(train_scores, test_scores, param_range):
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.title("Validation Curve")
    plt.xlabel("$C$")
    plt.ylabel("F1 Score")
    #plt.ylim(0.0, 1.1)
    plt.semilogx(param_range, train_scores_mean, label="Training score", color="r", linestyle="--")
    plt.fill_between(param_range, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.2, color="r")
    plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
                 color="g")
    plt.fill_between(param_range, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.2, color="g")
    plt.legend(loc='lower right')
    plt.show()
    
def flatten(somelist):
    return([item for sublist in somelist for item in sublist])

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    from sklearn examples :
    http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html#example-plot-learning-curve-py
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")
    plt.legend(loc='lower right')
    plt.show()

Human Activity Recognition from smartphone sensors data

Introduction

We try to recognize activities of 30 individuals using sensor data gathered from their smartphone. Human activity recognition can be used to measure physical activity of users of a smarphone or some wearable as smartwatches. This may be used to compute how much calories one burns during the day, and to display advices if the user does not take enough exercise to remain healthy.

Dataset

We use the Human Activity Recognition Using Smartphones Dataset (Version 1.0) that can be found on UCI archive.

It is experimental data : 30 volunteers (19-48 y.o.) performed six activities (walking, walking upstairs, walking downstairs, sitting, standing, laying) wearing a smartphone (Samsung Galaxy S II) on the waist. As the smartphone was worn on the waist, there is no rotation issue as there would be in the case the smartphone is "worn" in the pocket.

Signals come from the accelerometer and the gyroscope of the smartphone (captured at a constant rate of 50Hz). They were filtered to remove noise and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The acceleration signal was separated into body and gravity acceleration (using another filter). Body acceleration (from accelerometer) and angular velocity (from accelrometer) were derived to obtain Jerk signals. Magnitude of the three dimensional signals was calculated using the euclidean norm. Fast Fourier Transform was used to compute frequency domain signals.

At the end, we get a set of 17 signals, and for each of them, several features were created :

mean: Mean value
std: Standard deviation
mad: Median absolute deviation 
max: Largest value in array
min: Smallest value in array
sma: Signal magnitude area
energy: Energy measure. Sum of the squares divided by the number of values. 
iqr: Interquartile range 
entropy: Signal entropy
arCoeff: Autorregresion coefficients with Burg order equal to 4
correlation: correlation coefficient between two signals
maxInds: index of the frequency component with largest magnitude
meanFreq: Weighted average of the frequency components to obtain a mean frequency
skewness: skewness of the frequency domain signal 
kurtosis: kurtosis of the frequency domain signal 
bandsEnergy: Energy of a frequency interval within the 64 bins of the FFT of each window.
angle: Angle between to vectors.

Additional vectors obtained by averaging the signals in a signal window sample. These are used on the angle variable. There are 561 features on the dataset :


In [2]:
!wc -l ./data/features.txt


     561 ./data/features.txt

In [3]:
!cat data/activity_labels.txt


1 WALKING
2 WALKING_UPSTAIRS
3 WALKING_DOWNSTAIRS
4 SITTING
5 STANDING
6 LAYING

y_train contains one label / line

X_train contains the 561 features / line

The whole thing should fit in a pandas dataframe as it is 63Mo :


In [71]:
!ls -lh data/train/


total 256304
drwxr-xr-x@ 11 maryanmorel  staff   374B Nov 26 22:20 Inertial Signals
-rwxr-xr-x@  1 maryanmorel  staff    63M Nov 26 22:20 X_train.txt
-rw-r--r--   1 maryanmorel  staff    62M Dec 11 21:49 X_train_clean.txt
-rwxr-xr-x@  1 maryanmorel  staff    20K Nov 26 22:20 subject_train.txt
-rwxr-xr-x@  1 maryanmorel  staff    14K Nov 26 22:20 y_train.txt

In [7]:
!awk -F' ' '{print NF; exit}' data/train/X_train.txt


561

561 columns separated by spaces (' ')

Problem : there is one space in front of positive numbers (in order to get neat columns when you look at the text file), it messes up with pandas parser. We use unix command line utilities to replace ' ' (double spaces) it with ' ' (single space), and remove additional spaces before the first column :


In [3]:
!sed 's/^..//' data/train/X_train.txt > data/train/X_train_buffer.txt
!sed -e 's/  / /g' data/train/X_train_buffer.txt > data/train/X_train_clean.txt
!rm data/train/X_train_buffer.txt

In [7]:
!sed 's/^..//' data/test/X_test.txt > data/test/X_test_buffer.txt
!sed -e 's/  / /g' data/test/X_test_buffer.txt > data/test/X_test_clean.txt
!rm data/test/X_test_buffer.txt

In [34]:
X_train = pd.read_csv('./data/train/X_train_clean.txt', sep=' ', engine='c', header=None)
feat_names = flatten(pd.read_csv('data/features.txt', sep=' ', header=None, index_col=0).values.tolist())
X_train.columns = feat_names
X_test = pd.read_csv('./data/test/X_test_clean.txt', sep=' ', engine='c', header=None)
X_test.columns = feat_names
X_train.head(5)


Out[34]:
tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X tBodyAcc-std()-Y tBodyAcc-std()-Z tBodyAcc-mad()-X tBodyAcc-mad()-Y tBodyAcc-mad()-Z tBodyAcc-max()-X ... fBodyBodyGyroJerkMag-meanFreq() fBodyBodyGyroJerkMag-skewness() fBodyBodyGyroJerkMag-kurtosis() angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean) angle(X,gravityMean) angle(Y,gravityMean) angle(Z,gravityMean)
0 0.288585 -0.020294 -0.132905 -0.995279 -0.983111 -0.913526 -0.995112 -0.983185 -0.923527 -0.934724 ... -0.074323 -0.298676 -0.710304 -0.112754 0.030400 -0.464761 -0.018446 -0.841247 0.179941 -0.058627
1 0.278419 -0.016411 -0.123520 -0.998245 -0.975300 -0.960322 -0.998807 -0.974914 -0.957686 -0.943068 ... 0.158075 -0.595051 -0.861499 0.053477 -0.007435 -0.732626 0.703511 -0.844788 0.180289 -0.054317
2 0.279653 -0.019467 -0.113462 -0.995380 -0.967187 -0.978944 -0.996520 -0.963668 -0.977469 -0.938692 ... 0.414503 -0.390748 -0.760104 -0.118559 0.177899 0.100699 0.808529 -0.848933 0.180637 -0.049118
3 0.279174 -0.026201 -0.123283 -0.996091 -0.983403 -0.990675 -0.997099 -0.982750 -0.989302 -0.938692 ... 0.404573 -0.117290 -0.482845 -0.036788 -0.012892 0.640011 -0.485366 -0.848649 0.181935 -0.047663
4 0.276629 -0.016570 -0.115362 -0.998139 -0.980817 -0.990482 -0.998321 -0.979672 -0.990441 -0.942469 ... 0.087753 -0.351471 -0.699205 0.123320 0.122542 0.693578 -0.615971 -0.847865 0.185151 -0.043892

5 rows × 561 columns

Now read the labels :


In [35]:
classes = pd.read_csv('data/activity_labels.txt', header=None).values.tolist()
classes = [item for sublist in classes for item in sublist] # flatten list
y_train = pd.read_csv('data/train/y_train.txt', header=None)
y_test = pd.read_csv('data/test/y_test.txt', header=None)
y_train.columns = ["activity"]
y_test.columns = ["activity"]
y_train.head(5)


Out[35]:
activity
0 5
1 5
2 5
3 5
4 5

In [36]:
X_train.shape


Out[36]:
(7352, 561)

In [37]:
X_test.shape


Out[37]:
(2947, 561)

Data Analysis


In [38]:
X_train.describe()


Out[38]:
tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X tBodyAcc-std()-Y tBodyAcc-std()-Z tBodyAcc-mad()-X tBodyAcc-mad()-Y tBodyAcc-mad()-Z tBodyAcc-max()-X ... fBodyBodyGyroJerkMag-meanFreq() fBodyBodyGyroJerkMag-skewness() fBodyBodyGyroJerkMag-kurtosis() angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean) angle(X,gravityMean) angle(Y,gravityMean) angle(Z,gravityMean)
count 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 ... 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000 7352.000000
mean 0.276993 -0.017695 -0.109141 -0.605438 -0.510938 -0.604754 -0.630512 -0.526907 -0.606150 -0.468604 ... 0.125293 -0.307009 -0.625294 0.008684 0.002186 0.008726 -0.005981 -0.489547 0.058593 -0.056515
std 0.059623 0.040811 0.056635 0.448734 0.502645 0.418687 0.424073 0.485942 0.414122 0.544547 ... 0.250994 0.321011 0.307584 0.336787 0.448306 0.608303 0.477975 0.511807 0.297480 0.279122
min 0.001189 -1.000000 -1.000000 -1.000000 -0.999873 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 ... -1.000000 -0.995357 -0.999765 -0.976580 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
25% 0.263429 -0.024863 -0.120993 -0.992754 -0.978129 -0.980233 -0.993591 -0.978162 -0.980251 -0.936219 ... -0.023692 -0.542602 -0.845573 -0.121527 -0.289549 -0.482273 -0.376341 -0.812065 -0.017885 -0.143414
50% 0.277227 -0.017219 -0.108676 -0.946196 -0.851897 -0.859365 -0.950709 -0.857328 -0.857143 -0.881637 ... 0.134000 -0.343685 -0.711692 0.009509 0.008943 0.008735 -0.000368 -0.709417 0.182071 0.003181
75% 0.288661 -0.010783 -0.097794 -0.242813 -0.034231 -0.262415 -0.292680 -0.066701 -0.265671 -0.017129 ... 0.289096 -0.126979 -0.503878 0.150865 0.292861 0.506187 0.359368 -0.509079 0.248353 0.107659
max 1.000000 1.000000 1.000000 1.000000 0.916238 1.000000 1.000000 0.967664 1.000000 1.000000 ... 0.946700 0.989538 0.956845 1.000000 1.000000 0.998702 0.996078 1.000000 0.478157 1.000000

8 rows × 561 columns

This is signal data from an experiment : very clean, very regular, no missing values


In [39]:
y_train.activity.value_counts()


Out[39]:
6    1407
5    1374
4    1286
1    1226
2    1073
3     986
dtype: int64

There is a slight class imbalance : Class 6 (resp. 2 and 3) is more (resp. less) represented than average representation. This could have a negative impact on our classifiaction, but oversampling seems complicated given our features, and undersampling is costly as we only have 7352 samples.

There are many variables. We do not hope any correlation plot : it would be useless (too much information in a limited space) plus it would a good way to explode the memory. Let's try a PCA to get a synthetic look at the data.


In [40]:
pca = PCA(n_components=10)
pca.fit(X_train)
print(pca.explained_variance_ratio_)
print("Variability of the compressed dataset %.4f" %sum(pca.explained_variance_ratio_))


[ 0.62555992  0.04913095  0.0412155   0.01874993  0.01694941  0.01272099
  0.01176714  0.01068966  0.00969418  0.00858032]
Variability of the compressed dataset 0.8051

Keeping three components seems to be reasonable, let's plot the data according to these three components


In [30]:
plot_PCA(X_train.values, y_train.values, classes, azimuth=59, elevation=-21)


[ 0.62555992  0.04913095  0.0412155 ]
Variability of the compressed dataset 0.7159

In [31]:
plot_PCA(X_train.values, y_train.values, classes, azimuth=-93, elevation=-7)


[ 0.62555992  0.04913095  0.0412155 ]
Variability of the compressed dataset 0.7159

There are two big clusters that can be separated by an hyperplane. Each of these big clusters are compound of three smaller overlapping clusters. The first cluster contains activities involving walking, while the second contains activities for which the subject stays still.

According to the frist three components of the PCA, each of the three walking activities seems to be well expressed by the data. Laying seems to be well separated from other still positions (as the gravity effect is measured on a different axis : the smartphone is horizontal while it is vertical for sitting and standing activities.). However, sitting and standing seems to be harder to seperate. This is quite intuitive, as they are quite similar activities from an accelerometer or gyroscope perspective.

Nonetheless, these figures gives us great hope of finding some good classifier to separate our data (remember we are only using the three first principal components, accounting for approximately 71.6% of the variability).

Having one component explaining 65.5% of the variability also accounts for some high correlation in X_train.

What are our options ?

  1. Use principal components as features
  2. Use Support Vector Classifier. As $n << d$ and the PCA shows easy-separated data, the linear kernel may be powerful enough to find some good separating hyperplane.
  3. Use regularized logistic regression (LASSO or Ridge)
  4. If we're still stuck, try SVM with non-linear kernel

Modelisation

We use cross validation on 5-Folds. Parameters are selected using F1-score as this performance measure avoid giving good scores to naive models in case of class imbalance. Performance of each model is evaluated on test set (train test consists of 70% of the data) looking at precision, recall and F1-score. For more information on cross validation on K-Folds, cf. The Elements of Statistical Learning (T. Hastie & al.). For more information about the performance measure we use, cf. :


In [47]:
K_folds = 5

1. Logistic regression

On the compressed dataset


In [43]:
pca = PCA(n_components=50)
X_redux = pca.fit_transform(X_train)
print("\n Variability of the compressed dataset: %.4f of the total variability \n" %sum(pca.explained_variance_ratio_))
logit = linear_model.LogisticRegression()
logit.fit(X_redux, np.ravel(y_train))
y_pred = logit.predict(X_redux) ## No fit_predict method
print("\n performance on train dataset : \n")
print(metrics.classification_report(y_train, y_pred, target_names=classes))
print("\n\n performance on test dataset : \n")
y_pred = logit.predict(pca.transform(X_test))
print(metrics.classification_report(y_test, y_pred, target_names=classes))


 Variability of the compressed dataset: 0.9309 of the total variability 


 performance on train dataset : 

                      precision    recall  f1-score   support

           1 WALKING       0.98      0.99      0.98      1226
  2 WALKING_UPSTAIRS       0.98      0.98      0.98      1073
3 WALKING_DOWNSTAIRS       0.99      0.99      0.99       986
           4 SITTING       0.90      0.87      0.88      1286
          5 STANDING       0.88      0.91      0.90      1374
            6 LAYING       1.00      1.00      1.00      1407

         avg / total       0.95      0.95      0.95      7352



 performance on test dataset : 

                      precision    recall  f1-score   support

           1 WALKING       0.89      0.98      0.93       496
  2 WALKING_UPSTAIRS       0.92      0.86      0.89       471
3 WALKING_DOWNSTAIRS       0.93      0.92      0.93       420
           4 SITTING       0.94      0.85      0.89       491
          5 STANDING       0.88      0.94      0.91       532
            6 LAYING       1.00      1.00      1.00       537

         avg / total       0.93      0.93      0.93      2947

On the original dataset :


In [44]:
logit = linear_model.LogisticRegression()
logit.fit(X_train, np.ravel(y_train))
y_pred = logit.predict(X_train) ## No fit_predict method
print("\n performance on train dataset : \n")
print(metrics.classification_report(y_train, y_pred, target_names=classes))
print("\n\n performance on test dataset : \n")
y_pred = logit.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=classes))


 performance on train dataset : 

                      precision    recall  f1-score   support

           1 WALKING       1.00      1.00      1.00      1226
  2 WALKING_UPSTAIRS       1.00      1.00      1.00      1073
3 WALKING_DOWNSTAIRS       1.00      1.00      1.00       986
           4 SITTING       0.97      0.97      0.97      1286
          5 STANDING       0.98      0.98      0.98      1374
            6 LAYING       1.00      1.00      1.00      1407

         avg / total       0.99      0.99      0.99      7352



 performance on test dataset : 

                      precision    recall  f1-score   support

           1 WALKING       0.94      1.00      0.97       496
  2 WALKING_UPSTAIRS       0.97      0.95      0.96       471
3 WALKING_DOWNSTAIRS       1.00      0.97      0.98       420
           4 SITTING       0.97      0.88      0.92       491
          5 STANDING       0.90      0.97      0.94       532
            6 LAYING       1.00      1.00      1.00       537

         avg / total       0.96      0.96      0.96      2947

The model performs better without dimentionality reduction. As the time-complexity in the number of variable is reasonable considering the number of variables in our dataset, we decide to keep the original feature set.

Regarding performance, sitting and standing activities are harder to seperate as seen in the 3D plots. As sitting has a poor recall and standing a poor precision (compared to the other classes), we can deduce that lots of activites labeled as sitting are identified as standing. Resolving this issue may be hard.

There is some overfitting, thus we try a L1-regularized variant of logistic regression to prevent this.

L1 regularization

To prevent overfitting, we use L1 regularization (L2 regularization gives very similar results in our case). For more information about regularization and the influence of the parameter $C$, cf. http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression, or The Elements of Statistical Learning (T. Hastie & al.)


In [48]:
logit = linear_model.LogisticRegression()
C = np.logspace(-2, 4, 15) # Exponential exploration of the parameter space
params = {'penalty':['l1'], 'C':C} # L1 Penalty
train_scores, valid_scores = validation_curve(logit, X_train, np.ravel(y_train), \
                                              "C", C, cv=K_folds, verbose=0, scoring="f1", n_jobs=-1)
plot_validation_scores(train_scores, valid_scores, C)


The model still exhibits overfitting that cannot be reduced by penalizing the coefficients without decreasing the performance on cross validation samples. The cross-validation score is quite variable (+/- 2%) depending on the 5-Folds sampling.


In [49]:
logit = linear_model.LogisticRegression(penalty='l1', C=C[np.where(train_scores == np.max(train_scores))[0][0]])
print("Optimal C parameter : %.4f \n" %logit.C)
logit.fit(X_train, np.ravel(y_train))
y_pred = logit.predict(X_train) ## No fit_predict method
print("\n performance on train dataset : \n")
print(metrics.classification_report(y_train, y_pred, target_names=classes))
print("\n\n performance on test dataset : \n")
y_pred = logit.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=classes))


Optimal C parameter : 193.0698 


 performance on train dataset : 

                      precision    recall  f1-score   support

           1 WALKING       1.00      1.00      1.00      1226
  2 WALKING_UPSTAIRS       1.00      1.00      1.00      1073
3 WALKING_DOWNSTAIRS       1.00      1.00      1.00       986
           4 SITTING       1.00      1.00      1.00      1286
          5 STANDING       1.00      1.00      1.00      1374
            6 LAYING       1.00      1.00      1.00      1407

         avg / total       1.00      1.00      1.00      7352



 performance on test dataset : 

                      precision    recall  f1-score   support

           1 WALKING       0.94      1.00      0.96       496
  2 WALKING_UPSTAIRS       0.98      0.93      0.96       471
3 WALKING_DOWNSTAIRS       0.99      0.98      0.98       420
           4 SITTING       0.96      0.87      0.91       491
          5 STANDING       0.90      0.97      0.93       532
            6 LAYING       0.99      1.00      1.00       537

         avg / total       0.96      0.96      0.96      2947

As seen previously, the sitting and standing activities are note well separated. There is a still a huge overfit despite the regularization.

We try to improve these results by using a SVM classifier, as the hyperplane found by this model is often better at classification than the one found by logistic regression thanks to the margin maximization (for more precision, cf. The Elements of Statistical Learning T. Hastie & al.).

2. Support Vector Classifier

As the data is likely to be separable by an hyperplane in the original feature space, no need for nonlinear kernel here. We user linearSVC (SVC with linear kernel implementation of liblinear. Complexity is $O(np)$, i.e. this model is less complex than logistic regression that is $O(np^2)$. For more information on SVM and the influence of the parameter $C$, cf. http://scikit-learn.org/stable/modules/svm.html, or The Elements of Statistical Learning (T. Hastie & al.)


In [50]:
linSVC = svm.LinearSVC(dual=True, tol=0.0001, multi_class='ovr', fit_intercept=True, \
                intercept_scaling=1, verbose=0, random_state=42)
C = np.logspace(-3, 3, 15) # Exponential exploration of the parameter space
params = {'C':C}
train_scores, valid_scores = validation_curve(linSVC, X_train, np.ravel(y_train), \
                                              "C", C, cv=K_folds, verbose=0, scoring="f1", n_jobs=-1)
plot_validation_scores(train_scores, valid_scores, C)



In [60]:
linSVC.C = C[np.where(train_scores == np.max(train_scores))[0][0]]
print("Optimal C parameter : %.4f \n" %linSVC.C)
linSVC.fit(X_train, np.ravel(y_train))
y_pred = linSVC.predict(X_train)
print("\n performance on train dataset : \n")
print(metrics.classification_report(y_train, y_pred, target_names=classes))
print("\n\n performance on test dataset : \n")
y_pred = logit.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=classes))


Optimal C parameter : 19.3070 


 performance on train dataset : 

                      precision    recall  f1-score   support

           1 WALKING       1.00      1.00      1.00      1226
  2 WALKING_UPSTAIRS       1.00      1.00      1.00      1073
3 WALKING_DOWNSTAIRS       1.00      1.00      1.00       986
           4 SITTING       0.98      0.99      0.99      1286
          5 STANDING       0.99      0.99      0.99      1374
            6 LAYING       1.00      1.00      1.00      1407

         avg / total       1.00      1.00      1.00      7352



 performance on test dataset : 

                      precision    recall  f1-score   support

           1 WALKING       0.94      1.00      0.96       496
  2 WALKING_UPSTAIRS       0.98      0.93      0.96       471
3 WALKING_DOWNSTAIRS       0.99      0.98      0.98       420
           4 SITTING       0.96      0.87      0.91       491
          5 STANDING       0.90      0.97      0.93       532
            6 LAYING       0.99      1.00      1.00       537

         avg / total       0.96      0.96      0.96      2947

This linear SVC performs better than the logit, and is faster to compute.It still has a hard time to separate sitting and standing activities. In order to address this issue, we could try to :

  1. Try to obtain more data
  2. Improve the data representation of the signal contained in the dataset
  3. Improve the model

In [69]:
plot_learning_curve(linSVC, "Learning Curve (SVC)", X_train, np.ravel(y_train), cv=5, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 10))


As seen in the learning curves, the performance in terms of training example starts to level off at 5000+ training examples. Obtaining more data would help to reduce variance of our score, but may not be sufficient to improve the model performance dramatically as we would still have the standing/sitting representation overlap in our data.

Trying a nonlinear SVM (e.g. using a RBF kernel) would be very costly in terms of time-complexity as this model would be $O(n^3p)$ and would imply at least one more tuning parameter, making the cross-validation step way harder.

Further work on signal representation seems to be a more efficient way to improve the model, as we saw that these two classes seems to present serious overlapping with our features (cf. 3D plots). If we manage to find a representation in which these two signals would be easier to separate, this would solve our problem. This could by done by using spherical wavelets rather than Fourier transform in a scattering network fashion to perform the classification (cf. Stephane Mallat & al. work http://www.di.ens.fr/data/software/scatnet/).

Conclusion

Our best model is a linear Support Vector Classifier. This model can be seen as a first step toward an health monitoring application for smartphone or wearable. There are still some issues to face in order to launch a proper product :

  • In our dataset, the smartphone is fixed on the waist. In a real application, it would be in the pocket, and thus its position with respect to the body would vary. We should try to detect the position of the smartphone before the feature computation step in order to take accound of an eventual rotation.
  • It would be useful to detect if the smartphone is carried by the user, in order to avoid activity quantification when the smartphone is simply carried by a car.
  • We should try to recognize other activities that are quite common such as running or biking. These activities would be relevant for our health application.

Adressing these issues would add complexity to our task. In this fashion, scattering networks with spherical wavelet discussed above would be of some help.

The model we developped is time stable, thus it would not be useful to refresh it regularly. However, we should try to obtain a broader dataset in term of individuals in order to take account of the variety of human bodies & movements. As we are using smartphones to collect the data, obtaining more data points would not be very costly in term of aquisition or storage. If some retraining would be necessary, it could be done automatically as the whole parameter selection steps are done by our code without human intervention. Some checks should be programmed in order to verify that our exploration of the parameter set is wide enough, and to send alerts if there is some drop in performance.

As we use a SVM, the model is not very scalable, but as seen previously, this is not an issue as the learning curves show that increase in performance from additional data would not be very significant.

Such application could be thought as a smartphone app financed by advertising revenues, or as a SAAS (software as a service). In the SAAS fashion, the app would be accessed by an API to allow developpers to integrate some health analytics in their application. Pricing would be set according to the transactions volume (e.g. number of API requests).